You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/28 13:50:41 UTC

[GitHub] [spark] peter-toth opened a new pull request #31682: [SPARK-34545] Fix row comparisons

peter-toth opened a new pull request #31682:
URL: https://github.com/apache/spark/pull/31682


   ### What changes were proposed in this pull request?
   
   This PR fixes equality check of `Row` and `InternalRow`.
   
   ### Why are the changes needed?
   Otherwise rows containing `[1]` and `[1.0]` would be `true` and this can cause nasty issues like in https://issues.apache.org/jira/browse/SPARK-34545 description:
   
   ```
   >>> from pyspark.sql.functions import udf
   >>> from pyspark.sql.types import *
   >>> 
   >>> def udf1(data_type):
           def u1(e):
               return e[0]
           return udf(u1, data_type)
   >>> 
   >>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2'])
   >>> 
   >>> df = df.withColumn("c3", udf1(DoubleType())("c1"))
   >>> df = df.withColumn("c4", udf1(IntegerType())("c2"))
   >>> 
   >>> df.select("c3").show()
   +---+
   | c3|
   +---+
   |1.0|
   +---+
   
   >>> df.select("c4").show()
   +---+
   | c4|
   +---+
   |  1|
   +---+
   
   >>> df.select("c3", "c4").show()
   +---+----+
   | c3|  c4|
   +---+----+
   |1.0|null|
   +---+----+
   ```
   This is due to `Pickler` in `BatchEvalPythonExec` uses memoization.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, fixes a correctness issue.
   
   ### How was this patch tested?
   Added new UT + manual tests.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r588085701



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -131,7 +132,8 @@ private[spark] object SerDeUtil extends Logging {
   }
 
   private def checkPickle(t: (Any, Any)): (Boolean, Boolean) = {
-    val pickle = new Pickler
+    val pickle = new Pickler(/* useMemo = */ true,
+      /* valueCompare = */ false)

Review comment:
       ditto.
   ```scala
   -    val pickle = new Pickler(/* useMemo = */ true,
   -      /* valueCompare = */ false)
   +    val pickle = new Pickler(/* useMemo = */ true, /* valueCompare = */ false)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789533697


   **[Test build #135700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135700/testReport)** for PR 31682 at commit [`066f5a2`](https://github.com/apache/spark/commit/066f5a217ae963dc5ef6d9e0565db739c92f7be5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787892351


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40175/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788953968


   also cc @HyukjinKwon @ueshin 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

ueshin commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r587691527



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -78,7 +78,7 @@ private[spark] object SerDeUtil extends Logging {
    * Choose batch size based on size of objects
    */
   private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] {
-    private val pickle = new Pickler()
+    private val pickle = new Pickler(true, false)

Review comment:
       cool, thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790129982


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40304/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790588665


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40334/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789568735


   **[Test build #135701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135701/testReport)** for PR 31682 at commit [`94be320`](https://github.com/apache/spark/commit/94be320ad32645cc834ae9dd7f1156818c977a10).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789535098


   **[Test build #135700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135700/testReport)** for PR 31682 at commit [`066f5a2`](https://github.com/apache/spark/commit/066f5a217ae963dc5ef6d9e0565db739c92f7be5).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787465869


   **[Test build #135556 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135556/testReport)** for PR 31682 at commit [`41dddb6`](https://github.com/apache/spark/commit/41dddb6da33db089c65d6418d9fb59d05226b19b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790062053


   **[Test build #135722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135722/testReport)** for PR 31682 at commit [`bb31231`](https://github.com/apache/spark/commit/bb312316f9de7739997d74d03dddcdb61bf74ec1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790062053


   **[Test build #135722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135722/testReport)** for PR 31682 at commit [`bb31231`](https://github.com/apache/spark/commit/bb312316f9de7739997d74d03dddcdb61bf74ec1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789533697


   **[Test build #135700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135700/testReport)** for PR 31682 at commit [`066f5a2`](https://github.com/apache/spark/commit/066f5a217ae963dc5ef6d9e0565db739c92f7be5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788883607


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135658/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r588085881



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -182,7 +184,8 @@ private[spark] object SerDeUtil extends Logging {
       if (batchSize == 0) {
         new AutoBatchedPickler(cleaned)
       } else {
-        val pickle = new Pickler
+        val pickle = new Pickler(/* useMemo = */ true,
+          /* valueCompare = */ false)

Review comment:
       ditto.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-792664512


   Thanks for the review all.
   
   @srowen, I've opened a backport PR to 3.0: https://github.com/apache/spark/pull/31778


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r587330100



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -78,7 +78,7 @@ private[spark] object SerDeUtil extends Logging {
    * Choose batch size based on size of objects
    */
   private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] {
-    private val pickle = new Pickler()
+    private val pickle = new Pickler(true, false)

Review comment:
       Actually, there is a java way to do it `func(/*arg-name=*/ value)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789551870


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40282/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-792397813


   Merged to master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790543591


   **[Test build #135751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135751/testReport)** for PR 31682 at commit [`279f536`](https://github.com/apache/spark/commit/279f536167387ab21a56883f4d8827df9b03aef3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788889699


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40239/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790129982


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40304/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586187707



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    val pickle = new Pickler(needConversion, false)

Review comment:
       Could you leave some comments about why we need `valueCompare=false`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787481685


   **[Test build #135556 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135556/testReport)** for PR 31682 at commit [`41dddb6`](https://github.com/apache/spark/commit/41dddb6da33db089c65d6418d9fb59d05226b19b).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787493494


   **[Test build #135565 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135565/testReport)** for PR 31682 at commit [`d5ac2fe`](https://github.com/apache/spark/commit/d5ac2fe33ceab8f8fbf80c940736f89e67c460e5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787510256


   **[Test build #135566 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135566/testReport)** for PR 31682 at commit [`6d66ef3`](https://github.com/apache/spark/commit/6d66ef3aab4db67af532ecdd9559e469b1c7ef05).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787500650


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40146/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789615834


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40283/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790597093


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40334/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787887068


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40175/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586644055



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    val pickle = new Pickler(needConversion, false)

Review comment:
       > so we need to disable this feature explicitly (`valueCompare=false`).
   
   I think it is not disabled, but just not compare with value when memorizing?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788242330


   > > I think this is because Spark 2.4 uses pyrolite 4.13 where it looks the value comparison feature is 
   > > I think another option would be to disable the valueCompare explicitly.
   > 
   > Yea, I think so, too. Could you try `valueCompare=false` then check performance changes just in case?
   
   Looks like `valueCompare` as true actually isn't for performance gain. But it's also hard to tell because reduced pickle size may improve performance too. I guess performance shouldn't be an issue here.
   
   https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L86-L89
   
   ```java
   /**
    * When memoizing, compare objects by value. This saves pickle size, but can slow down pickling.
    * Also, it should only be used if the object graph is immutable. Unused if useMemo is false.
    */
    protected boolean valueCompare=true;
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788247450


   > My idea is to let one `Pickler` instance only handle data of the same schema.
   > 
   > IIUC the Python UDF operator needs to send the input (values of (c1, c2)) from JVM to Python, run the UDF, and send back the UDF result (values of (c3, c4)) from Python to JVM. Since the `Pickler` instance is used to serialize both the input and output data, the bug happens. Do I understand it correctly?
   
   No sorry, the issue is that the `Pickler` instance in JVM that serializes the input data `(c1, c2)` = `((1.0, 1.0), (1, 1))` serializes it as if it were `((1.0, 1.0), (1.0, 1.0))` (i.e. sends the serialized data something like `((1.0, 1.0), <some short (hash?) code of (1.0, 1.0) instance we've seen before>`). At python side the other `Pickler` (and actually it is not a pyrolite `Pickler` but some Python lib), that serializes the output has nothing to do with the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787461888


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40137/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790090420


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40304/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788950105


   > correct me if I'm wrong: pickler recursively serializes the input and applies the cache. The input is a row of `(c1, c2)`, but pickler recursively serializes the row of `c1` and `c2`, and causes a problem because of the cache.
   
   You are right that caching has an important role in this issue. But IMO cache lookup by references can't cause issues if we use immutable objects. The issue here is that pytolite 4.21 introduced cache lookup by value and some of our data structures (`GenericRowWithSchema`) behaves weird when comparing them with `.equals()`...
   
   > Then I think it's not realistic to make one pickler instance to handle data with the same schema. Turning off `valueCompare` may be the only choice.
   
   Agreed, this looks like to be the easiest way to fix the issue. I've already modified this PR to revert previous changes and add `valueCompare=false`.
   
   I would argue though that `valueCompare=false` is not the only way to fix it. If `GenericRowWithSchema.equals()` took the schema into account in its `.equals()` then it would also fix it.
    
   > To evaluate the severity of the problem, it seems only an issue when there are nested struct types?
   
   Yes.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787500010


   **[Test build #135565 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135565/testReport)** for PR 31682 at commit [`d5ac2fe`](https://github.com/apache/spark/commit/d5ac2fe33ceab8f8fbf80c940736f89e67c460e5).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789551870


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40282/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787509799


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40146/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584483705



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/RowTest.scala
##########
@@ -108,6 +108,31 @@ class RowTest extends AnyFunSpec with Matchers {
     }
   }
 
+  describe("row not equals") {
+    val externalRow = Row(1)
+    val externalRow2 = Row(1.0)
+
+    it("inequality check for external rows") {
+      externalRow should not equal externalRow2
+    }

Review comment:
       Hmm, isn't this somehow arguable? Especially 1 == 1.0 => true.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790543591


   **[Test build #135751 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135751/testReport)** for PR 31682 at commit [`279f536`](https://github.com/apache/spark/commit/279f536167387ab21a56883f4d8827df9b03aef3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787463135


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40137/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [WIP][SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787527965






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787900034


   **[Test build #135595 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135595/testReport)** for PR 31682 at commit [`945b679`](https://github.com/apache/spark/commit/945b679bf476005fd69cb7e787bae372a1a12255).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787927493


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135595/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586680236



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    val pickle = new Pickler(needConversion, false)

Review comment:
       No, cache (by reference) is not disabled. That is the first parameter of 'Pickler'. But if "cache by value" sounds confusing then I can rephrase the comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790597093


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40334/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31682: [WIP][SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787574429


   cc: @cloud-fan @viirya , too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787900034


   **[Test build #135595 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135595/testReport)** for PR 31682 at commit [`945b679`](https://github.com/apache/spark/commit/945b679bf476005fd69cb7e787bae372a1a12255).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787509798


   The error is:
   `Incorrect evaluation (codegen off): [true,-0.0] IN ([false,-0.0],[true,3.1807339850248914E35],null,null,[null,NaN],[true,-Infinity],[false,-1.7976931348623157E308],[true,0.0],[null,Infinity]), actual: true, expected: null`
   
   Looks like the issue is comparison of 0.0 to -0.0. Eh, I've forgotten if 0.0 should equal -0.0 in SQL? It looks like it's matching now, which is weird to me, because for example `new java.lang.Double(0.0).equals(new java.lang.Double(-0.0))` is false but `0.0 == -0.0` is true. We've moved to .equals() for comparison in this case (right?)
   
   Anyway it looks like it's expecting 'null' for this IN predicate, so maybe it's something slightly different. I don't quite see why. Maybe you know better than I do how to analyze this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787927493


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135595/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788889699


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40239/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789623045


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40283/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790634529


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135751/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787463128


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40137/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789659090


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135701/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789659090


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135701/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787895290


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135594/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [WIP][SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787527306


   **[Test build #135566 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135566/testReport)** for PR 31682 at commit [`6d66ef3`](https://github.com/apache/spark/commit/6d66ef3aab4db67af532ecdd9559e469b1c7ef05).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787892381


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40175/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788247450


   > My idea is to let one `Pickler` instance only handle data of the same schema.
   > 
   > IIUC the Python UDF operator needs to send the input (values of (c1, c2)) from JVM to Python, run the UDF, and send back the UDF result (values of (c3, c4)) from Python to JVM. Since the `Pickler` instance is used to serialize both the input and output data, the bug happens. Do I understand it correctly?
   
   No sorry, the issue is that the `Pickler` instance in JVM that serializes the input data `(c1, c2)` = `((1.0, 1.0), (1, 1))` serializes it as if it were `((1.0, 1.0), (1.0, 1.0))` (i.e. sends the serialized data something like `((1.0, 1.0), <some short (hash?) code of (1.0, 1.0) instance we've seen before>`). At python side the other `Pickler` (and actually it is not a pyrolite `Pickler` but some other Python lib), that serializes the output has nothing to do with the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788185220


   > > During serialization GenericRowWithSchema(1.0, 1.0) will be memoized first
   > 
   > Can we fix the problem if we use different `Pickler` instances for sending the input data to python and sending the UDF results to JVM?
   
   Not sure I get this. We don't use the same instance. How could we use the same instance? One lives in the JVM and the other in Python. We use `pyrolite` `Pickler` in the JVM. The `GenericRowWithSchema(1.0, 1.0)` and `GenericRowWithSchema(1, 1)` instances are the tuple values of `c1` and `c2` respectively. We send both of them from JVM to Python.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787958295


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40176/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r588085473



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -78,7 +78,8 @@ private[spark] object SerDeUtil extends Logging {
    * Choose batch size based on size of objects
    */
   private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] {
-    private val pickle = new Pickler()
+    private val pickle = new Pickler(/* useMemo = */ true,
+      /* valueCompare = */ false)

Review comment:
       nit. One-line may look better.
   ```scala
   -    private val pickle = new Pickler(/* useMemo = */ true,
   -      /* valueCompare = */ false)
   +    private val pickle = new Pickler(/* useMemo = */ true, /* valueCompare = */ false)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

ueshin commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586925305



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -78,7 +78,7 @@ private[spark] object SerDeUtil extends Logging {
    * Choose batch size based on size of objects
    */
   private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] {
-    private val pickle = new Pickler()
+    private val pickle = new Pickler(true, false)

Review comment:
       nit: I'd prefer to use named arguments for readability: `new Pickler(useMemo=true, valueCompare=false)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586763984



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    val pickle = new Pickler(needConversion, false)

Review comment:
       I've added a short comment on cache by reference:  https://github.com/apache/spark/pull/31682/commits/bb312316f9de7739997d74d03dddcdb61bf74ec1.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787507397


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40146/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584847686



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/RowTest.scala
##########
@@ -108,6 +108,17 @@ class RowTest extends AnyFunSpec with Matchers {
     }
   }
 
+  describe("SPARK-34545: rows with different schema are not equal") {

Review comment:
       I've updated the PR description to focus on the correctness issue. I think it now clearly describes what is the issue with comparing `GenericRowWithSchema` objects.
   
   First I thought that this is a generic issue with `Row` that's why I tried to fix `.equals()` there. But that raised other compatibility issues (https://github.com/apache/spark/pull/31682#discussion_r584488733).
   
   Then the current PR fixes `.equals()` in `GenericRowWithSchema` only, but you can see that this also breaks a lots of UTs as many times in UTs we compare `GenericRowWithSchema` to `GenericRow` just to verify its content. 
   
   Are those UTs incorrect and shall we fix them? Or shall I give up this idea of fixing `.equals()` and we should just disable `valueCompare` in pyrolite: https://github.com/apache/spark/pull/31682#issuecomment-787904759 I've checked it, it also fixes the issue.
   
   I will add an end to end test definitely (Just need to figure out where we have python e2e test. Sorry I'm not familiar with them).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r588085701



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -131,7 +132,8 @@ private[spark] object SerDeUtil extends Logging {
   }
 
   private def checkPickle(t: (Any, Any)): (Boolean, Boolean) = {
-    val pickle = new Pickler
+    val pickle = new Pickler(/* useMemo = */ true,
+      /* valueCompare = */ false)

Review comment:
       ditto.
   ```scala
   -    val pickle = new Pickler(/* useMemo = */ true,
   -      /* valueCompare = */ false)
   +    val pickle = new Pickler(/* useMemo = */ true, /* valueCompare = */ false)
   ```

##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -78,7 +78,8 @@ private[spark] object SerDeUtil extends Logging {
    * Choose batch size based on size of objects
    */
   private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] {
-    private val pickle = new Pickler()
+    private val pickle = new Pickler(/* useMemo = */ true,
+      /* valueCompare = */ false)

Review comment:
       nit. One-line may look better.
   ```scala
   -    private val pickle = new Pickler(/* useMemo = */ true,
   -      /* valueCompare = */ false)
   +    private val pickle = new Pickler(/* useMemo = */ true, /* valueCompare = */ false)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787869572


   **[Test build #135594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135594/testReport)** for PR 31682 at commit [`ea49b19`](https://github.com/apache/spark/commit/ea49b198a0944d3ee541ecf9cc530b8e90ca83de).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790151517


   **[Test build #135722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135722/testReport)** for PR 31682 at commit [`bb31231`](https://github.com/apache/spark/commit/bb312316f9de7739997d74d03dddcdb61bf74ec1).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789656904


   **[Test build #135701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135701/testReport)** for PR 31682 at commit [`94be320`](https://github.com/apache/spark/commit/94be320ad32645cc834ae9dd7f1156818c977a10).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787869572


   **[Test build #135594 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135594/testReport)** for PR 31682 at commit [`ea49b19`](https://github.com/apache/spark/commit/ea49b198a0944d3ee541ecf9cc530b8e90ca83de).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787509799


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40146/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [WIP][SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787517417


   Hmm, I'm not sure that it is safe to change the behaviour of `Row.equals`, probably it would be better to change only `GenericRowWithSchema.equals` to take into account the `schema` as well. That would also fix the issue in the ticket.
   I've added WIP label to the PR and will look into this tomorrow...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787895051


   **[Test build #135594 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135594/testReport)** for PR 31682 at commit [`ea49b19`](https://github.com/apache/spark/commit/ea49b198a0944d3ee541ecf9cc530b8e90ca83de).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584488570



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/RowTest.scala
##########
@@ -108,6 +108,31 @@ class RowTest extends AnyFunSpec with Matchers {
     }
   }
 
+  describe("row not equals") {
+    val externalRow = Row(1)
+    val externalRow2 = Row(1.0)
+
+    it("inequality check for external rows") {
+      externalRow should not equal externalRow2
+    }

Review comment:
       Yes, I was thinking about this, Seq(1) == Seq(1.0) is also true.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788247450


   > My idea is to let one `Pickler` instance only handle data of the same schema.
   > 
   > IIUC the Python UDF operator needs to send the input (values of (c1, c2)) from JVM to Python, run the UDF, and send back the UDF result (values of (c3, c4)) from Python to JVM. Since the `Pickler` instance is used to serialize both the input and output data, the bug happens. Do I understand it correctly?
   
   No sorry, the issue is that the `Pickler` instance in JVM serializes the input data `(c1, c2)` = `((1.0, 1.0), (1, 1))` as if it were `((1.0, 1.0), (1.0, 1.0))` (i.e. sends the serialized data as something like `((1.0, 1.0), <some short (hash?) code of (1.0, 1.0) instance we've seen before>`). At python side the other `Pickler` (and actually it is not a pyrolite `Pickler` but some Python lib), that serializes the output has nothing to do with the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

cloud-fan edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788226385


   My idea is to let one `Pickler` instance only handle data of the same schema.
   
   IIUC the Python UDF operator needs to send the input (values of (c1, c2)) from JVM to Python, run the UDF, and send back the UDF result (values of (c3, c4)) from Python to JVM. Since the `Pickler` instance is used to serialize both the input and output data, the bug happens. Do I understand it correctly?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787895290


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135594/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #31682: [WIP][SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584397647



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/RowTest.scala
##########
@@ -108,6 +108,31 @@ class RowTest extends AnyFunSpec with Matchers {
     }
   }
 
+  describe("row not equals") {

Review comment:
       Can we add a JIRA ID?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789547522


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135700/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788887152


   correct me if I'm wrong: pickler recursively serializes the input and applies the cache. The input is a row of `(c1, c2)`, but pickler recursively serializes the row of `c1` and `c2`, and causes a problem because of the cache.
   
   Then I think it's not realistic to make one pickler instance to handle data with the same schema. Turning off `valueCompare` may be the only choice.
   
   To evaluate the severity of the problem, it seems only an issue when there are nested struct types?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586680236



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    val pickle = new Pickler(needConversion, false)

Review comment:
       No, cache (by reference) is not disabled. That is the first parameter of `Pickler`. If "cache by value" sounds confusing then I can rephrase the comment.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    val pickle = new Pickler(needConversion, false)

Review comment:
       No, cache (by reference) is not disabled. That is the first parameter of `Pickler`.
   If "cache by value" sounds confusing then I can rephrase the comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788950105


   > correct me if I'm wrong: pickler recursively serializes the input and applies the cache. The input is a row of `(c1, c2)`, but pickler recursively serializes the row of `c1` and `c2`, and causes a problem because of the cache.
   
   You are right that caching has an important role in this issue. But IMO cache lookup by references can't cause issues if we use immutable objects. The issue here is that pytolite 4.21 introduced cache lookup by value and some of our data structures (`GenericRowWithSchema`) behaves weird when comparing them with `.equals()`...
   
   > Then I think it's not realistic to make one pickler instance to handle data with the same schema. Turning off `valueCompare` may be the only choice.
   
   Agreed, this looks like to be the easiest way to fix the issue now. I've already modified this PR to revert previous changes and add `valueCompare=false`.
   
   I would argue though that `valueCompare=false` is not the only way to fix it. If `GenericRowWithSchema.equals()` took the schema into account in its `.equals()` then it would also fix it.
    
   > To evaluate the severity of the problem, it seems only an issue when there are nested struct types?
   
   Yes.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r587348420



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -78,7 +78,7 @@ private[spark] object SerDeUtil extends Logging {
    * Choose batch size based on size of objects
    */
   private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] {
-    private val pickle = new Pickler()
+    private val pickle = new Pickler(true, false)

Review comment:
       fixed in https://github.com/apache/spark/pull/31682/commits/279f536167387ab21a56883f4d8827df9b03aef3, for some reason scalastyle doesn't like more than one of these in one line...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788950105


   > correct me if I'm wrong: pickler recursively serializes the input and applies the cache. The input is a row of `(c1, c2)`, but pickler recursively serializes the row of `c1` and `c2`, and causes a problem because of the cache.
   
   You are right that caching has an important role in this issue. But IMO cache lookup by references can't cause issues if we use immutable objects. The issue here is that pytolite 4.21 introduced cache lookup by value and some of our data structures (`GenericRowWithSchema`) behaves weird when comparing them with `.equals()`...
   
   > Then I think it's not realistic to make one pickler instance to handle data with the same schema. Turning off `valueCompare` may be the only choice.
   
   Agreed. I've already modified this PR to revert previous changes and add `valueCompare=false`.
    
   > To evaluate the severity of the problem, it seems only an issue when there are nested struct types?
   
   Yes.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788111973


   > This is because during serialization (1, 1) is replaced to (1.0, 1.0) and return type of c4 is expected to be IntegerType and if a different type comes back from python then it is discarded
   
   How is this related to the `valueCompare` flag? It casts all numeric to double?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788121131


   > > This is because during serialization (1, 1) is replaced to (1.0, 1.0) and return type of c4 is expected to be IntegerType and if a different type comes back from python then it is discarded
   > 
   > How is this related to the `valueCompare` flag? It casts all numeric to double?
   
   No. During serialization `GenericRowWithSchema(1.0, 1.0)` will be memoized first and when `GenericRowWithSchema(1, 1)` comes then it is replaced to some short code of the first one (instead of writing the second one out) as the they are equal and `valueCompare` is enabled.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790627182


   **[Test build #135751 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135751/testReport)** for PR 31682 at commit [`279f536`](https://github.com/apache/spark/commit/279f536167387ab21a56883f4d8827df9b03aef3).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584490309



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala
##########
@@ -436,7 +436,7 @@ class DataFrameStatSuite extends QueryTest with SharedSparkSession {
     val df = spark.range(0, 100)
       .select(lit("Foo").as("name"), (col("id") % 3).as("key"))
     val sampled = df.stat.sampleBy(
-      struct($"name", $"key"), Map(Row("Foo", 0) -> 0.1, Row("Foo", 1) -> 0.2), 0L)
+      struct($"name", $"key"), Map(Row("Foo", 0L) -> 0.1, Row("Foo", 1L) -> 0.2), 0L)

Review comment:
       Yes, this where I realized that changeing to `.equals` is probably not a good idea: https://github.com/apache/spark/pull/31682#issuecomment-787517417




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790634529


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135751/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31682: [WIP][SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787510256


   **[Test build #135566 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135566/testReport)** for PR 31682 at commit [`6d66ef3`](https://github.com/apache/spark/commit/6d66ef3aab4db67af532ecdd9559e469b1c7ef05).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789551840


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40282/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788145063


   > > During serialization GenericRowWithSchema(1.0, 1.0) will be memoized first
   > 
   > Is this a global cache or per-query? This looks scary and I would like to disable it.
   
   Can be specified when we create a new `Pickler` (https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L119) and we create it per python evaluation: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L49 but there are some other places where we use `Pickler`. I think we haven't seen any issues with its cacheing just a the `valueCompare` that broke the things in Spark 3.0. 
   
   Shall I move this PR into this direction and disable `valueCompare`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790161043


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135722/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790161043


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135722/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787456233


   cc @HyukjinKwon 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787958295


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40176/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

ueshin commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586925305



##########
File path: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
##########
@@ -78,7 +78,7 @@ private[spark] object SerDeUtil extends Logging {
    * Choose batch size based on size of objects
    */
   private[spark] class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] {
-    private val pickle = new Pickler()
+    private val pickle = new Pickler(true, false)

Review comment:
       ~nit: I'd prefer to use named arguments for readability: `new Pickler(useMemo=true, valueCompare=false)`.~
   nvm, as the library is in Java, we can't specify the names.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788247450


   > My idea is to let one `Pickler` instance only handle data of the same schema.
   > 
   > IIUC the Python UDF operator needs to send the input (values of (c1, c2)) from JVM to Python, run the UDF, and send back the UDF result (values of (c3, c4)) from Python to JVM. Since the `Pickler` instance is used to serialize both the input and output data, the bug happens. Do I understand it correctly?
   
   No sorry, the issue is that the `Pickler` instance in JVM serializes the input data `(c1, c2)` = `((1.0, 1.0), (1, 1))` as if it were `((1.0, 1.0), (1.0, 1.0))` (i.e. sends the serialized data as something like `((1.0, 1.0), <some short (hash?) code of (1.0, 1.0) instance we've seen before>`). At python side the other `Pickler` (and actually it is not a pyrolite `Pickler` but some Python lib), that serializes the output, has nothing to do with the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [WIP][SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787527965






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584847686



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/RowTest.scala
##########
@@ -108,6 +108,17 @@ class RowTest extends AnyFunSpec with Matchers {
     }
   }
 
+  describe("SPARK-34545: rows with different schema are not equal") {

Review comment:
       I've updated the PR description to focus on the correctness issue. I think it now clearly describes what is the issue with comparing `GenericRowWithSchema` objects.
   
   First I thought that this is a generic issue with `Row` that's why I tried to fix `.equals()` there. But that raised other compatibility issues (https://github.com/apache/spark/pull/31682#discussion_r584488733).
   
   Then the current PR fixes `.equals()` in `GenericRowWithSchema` only, but you can see that this also breaks a lots of UTs as many times in UTs we compare `GenericRowWithSchema` to `GenericRow` just to verify its content. 
   
   Are those UTs incorrect and shall we fix them? Or shall I give up this idea of fixing `.equals()` and we should just disable `valueCompare` in pyrolite: https://github.com/apache/spark/pull/31682#issuecomment-787904759 I've checked it, it also fixes the issue.
   
   I will add an end to end test definitely. (Just need to figure out where we have python e2e test. Sorry I'm not familiar with them.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r588087266



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,18 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    // pyrolite 4.21+ can lookup objects in its cache by value, but `GenericRowWithSchema` objects,
+    // that we pass from JVM to Python, don't define their `equals()` to take the type of the
+    // values or the schema of the row into account. This causes like
+    // `GenericRowWithSchema(Array(1.0, 1.0),
+    //    StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType))))`
+    // and
+    // `GenericRowWithSchema(Array(1, 1),
+    //    StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType))))`
+    // to be `equal()` and so we need to disable this feature explicitly (`valueCompare=false`).
+    // Please note that cache by reference is still enabled depending on `needConversion`.
+    val pickle = new Pickler(/* useMemo = */ needConversion,
+      /* valueCompare = */ false)

Review comment:
       ```scala
   -    val pickle = new Pickler(/* useMemo = */ needConversion,
   -      /* valueCompare = */ false)
   +    val pickle = new Pickler(/* useMemo = */ needConversion, /* valueCompare = */ false)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584488221



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
##########
@@ -447,7 +447,7 @@ trait Row extends Serializable {
             if (d1.compareTo(o2.asInstanceOf[java.math.BigDecimal]) != 0) {
               return false
             }
-          case _ => if (o1 != o2) {
+          case _ => if (!o1.equals(o2)) {

Review comment:
       I was told that these 2 are the same in scala... seems they are different for primitive types.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787463135


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40137/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790121266


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40304/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-792399122


   Yeah, I agree with this change too. I thought I left some comments but seems I didn't ..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788247450


   > My idea is to let one `Pickler` instance only handle data of the same schema.
   > 
   > IIUC the Python UDF operator needs to send the input (values of (c1, c2)) from JVM to Python, run the UDF, and send back the UDF result (values of (c3, c4)) from Python to JVM. Since the `Pickler` instance is used to serialize both the input and output data, the bug happens. Do I understand it correctly?
   
   No sorry, the issue is that the `Pickler` instance in JVM that serializes the input data `(c1, c2)` = `((1.0, 1.0), (1, 1))` serializes it as if it were `((1.0, 1.0), (1.0, 1.0))` (i.e. sends the serialized data as something like `((1.0, 1.0), <some short (hash?) code of (1.0, 1.0) instance we've seen before>`). At python side the other `Pickler` (and actually it is not a pyrolite `Pickler` but some Python lib), that serializes the output has nothing to do with the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787481739


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135556/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789594485


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40283/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586770916



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    val pickle = new Pickler(needConversion, false)

Review comment:
       Thank you.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787904759


   > I think this is because Spark 2.4 uses pyrolite 4.13 where it looks the value comparison feature is missing?
   https://github.com/irmen/Pyrolite/blob/pyrolite-4.13/java/src/main/java/net/razorvine/pickle/Pickler.java#L63-L70 while in Spark 3.x we use pyrolite 4.30 that introduced? and enabled value comparison by default: https://github.com/irmen/Pyrolite/blob/pyrolite-4.30/java/src/main/java/net/razorvine/pickle/Pickler.java#L96-L122
   
   Oh, I see. Thanks for the answer. 
   
   > I think another option would be to disable the valueCompare explicitly.
   
   Yea, I think so, too. Could you try `valueCompare=false` then check performance changes just in case?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788145063


   > > During serialization GenericRowWithSchema(1.0, 1.0) will be memoized first
   > 
   > Is this a global cache or per-query? This looks scary and I would like to disable it.
   
   Can be specified when we create a new `Pickler` (https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L119) and we create it per python evaluation: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L49 but there are some other places where we use it...
   
   Shall I move this PR into this direction and disable `valueCompare`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-792736062


   Oh OK I need to put this in 3.1 then too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen closed pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

srowen closed pull request #31682:
URL: https://github.com/apache/spark/pull/31682


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r586230186



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
##########
@@ -46,7 +46,7 @@ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute]
     val needConversion = dataTypes.exists(EvaluatePython.needConversionInPython)
 
     // enable memo iff we serialize the row with schema (schema and class should be memorized)
-    val pickle = new Pickler(needConversion)
+    val pickle = new Pickler(needConversion, false)

Review comment:
       Sure, added in https://github.com/apache/spark/pull/31682/commits/94be320ad32645cc834ae9dd7f1156818c977a10




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31682: [WIP][SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787593248


   Why can spark-2.4 handle it correctly?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788226385


   My idea is to let one `Pickler` instance only handle data of the same schema.
   
   IIUC the Python UDF operator needs to send the input (values of (c1, c2)) to Python, run the UDF, and send back the UDF result (values of (c3, c4)). Since the `Pickler` instance is used to serialize both the input and output data, the bug happens. Do I understand it correctly?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788883607


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135658/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787493494


   **[Test build #135565 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135565/testReport)** for PR 31682 at commit [`d5ac2fe`](https://github.com/apache/spark/commit/d5ac2fe33ceab8f8fbf80c940736f89e67c460e5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787465869


   **[Test build #135556 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135556/testReport)** for PR 31682 at commit [`41dddb6`](https://github.com/apache/spark/commit/41dddb6da33db089c65d6418d9fb59d05226b19b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789547522


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135700/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787500106


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135565/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789568735


   **[Test build #135701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135701/testReport)** for PR 31682 at commit [`94be320`](https://github.com/apache/spark/commit/94be320ad32645cc834ae9dd7f1156818c977a10).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789623045


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40283/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787892381


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40175/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-790567386


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40334/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584488733



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala
##########
@@ -436,7 +436,7 @@ class DataFrameStatSuite extends QueryTest with SharedSparkSession {
     val df = spark.range(0, 100)
       .select(lit("Foo").as("name"), (col("id") % 3).as("key"))
     val sampled = df.stat.sampleBy(
-      struct($"name", $"key"), Map(Row("Foo", 0) -> 0.1, Row("Foo", 1) -> 0.2), 0L)
+      struct($"name", $"key"), Map(Row("Foo", 0L) -> 0.1, Row("Foo", 1L) -> 0.2), 0L)

Review comment:
       what happens if we don't add the suffix `L`? If it fails when it seems a breaking change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

peter-toth commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584489967



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
##########
@@ -447,7 +447,7 @@ trait Row extends Serializable {
             if (d1.compareTo(o2.asInstanceOf[java.math.BigDecimal]) != 0) {
               return false
             }
-          case _ => if (o1 != o2) {
+          case _ => if (!o1.equals(o2)) {

Review comment:
       I found this: https://alexknvl.com/posts/eq.html




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787481739


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135556/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787927345


   **[Test build #135595 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135595/testReport)** for PR 31682 at commit [`945b679`](https://github.com/apache/spark/commit/945b679bf476005fd69cb7e787bae372a1a12255).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789548843


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40282/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31682: [SPARK-34545] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787500106


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135565/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788136544


   > During serialization GenericRowWithSchema(1.0, 1.0) will be memoized first
   
   Is this a global cache or per-query? This looks scary and I would like to disable it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-787873126


   @maropu
   > Why can [spark-2](https://issues.apache.org/jira/browse/SPARK-2).4 handle it correctly?
   
   I think this is because Spark 2.4 uses pyrolite 4.13 where it looks the value comparison feature is missing?
   https://github.com/irmen/Pyrolite/blob/pyrolite-4.13/java/src/main/java/net/razorvine/pickle/Pickler.java#L63-L70 while in Spark 3.x we use pyrolite 4.30 that introduced? and enabled value comparison by default: https://github.com/irmen/Pyrolite/blob/pyrolite-4.30/java/src/main/java/net/razorvine/pickle/Pickler.java#L96-L122
   
   I think another option would be to disable the `valueCompare` explicitly.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788121131


   > > This is because during serialization (1, 1) is replaced to (1.0, 1.0) and return type of c4 is expected to be IntegerType and if a different type comes back from python then it is discarded
   > 
   > How is this related to the `valueCompare` flag? It casts all numeric to double?
   
   No. During serialization `GenericRowWithSchema(1.0, 1.0)` will be memoized first and when `GenericRowWithSchema(1, 1)` comes then it is replaced to some short code of the first one (instead of writing the second out) as the they are equal and `valueCompare` is enabled.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788950105


   > correct me if I'm wrong: pickler recursively serializes the input and applies the cache. The input is a row of `(c1, c2)`, but pickler recursively serializes the row of `c1` and `c2`, and causes a problem because of the cache.
   
   You are right that caching has an important role in this issue. But IMO cache lookup by references can't cause issues if we use immutable objects. The issue here is that pytolite 4.21 introduced cache lookup by value and some of our data structures (`GenericRowWithSchema`) behaves weird when comparing them with `.equals()`...
   
   > Then I think it's not realistic to make one pickler instance to handle data with the same schema. Turning off `valueCompare` may be the only choice.
   
   Agreed, this looks like to be the easiest way to fix the issue. I've already modified this PR to revert previous changes and add `valueCompare=false`.
   
   I would argue that `valueCompare=false` is not the only way to fix it. If `GenericRowWithSchema.equals()` took the schema into account in its `.equals()` then it would also fix it.
    
   > To evaluate the severity of the problem, it seems only an issue when there are nested struct types?
   
   Yes.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

peter-toth commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788247450


   > My idea is to let one `Pickler` instance only handle data of the same schema.
   > 
   > IIUC the Python UDF operator needs to send the input (values of (c1, c2)) from JVM to Python, run the UDF, and send back the UDF result (values of (c3, c4)) from Python to JVM. Since the `Pickler` instance is used to serialize both the input and output data, the bug happens. Do I understand it correctly?
   
   No sorry, the issue is that the `Pickler` instance in JVM that serializes the input data `(c1, c2)` = `((1.0, 1.0), (1, 1))` serializes it as if it were `((1.0, 1.0), (1.0, 1.0))` (i.e. sends the serialized data something like `(1.0, 1.0), <some hash code of (1.0, 1.0) instance we've seen before>`). At python side the other `Pickler` (and actually it is not a pyrolite `Pickler` but some other Python lib), that serializes the output has nothing to do with the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31682: [WIP][SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788171596


   > During serialization GenericRowWithSchema(1.0, 1.0) will be memoized first
   
   Can we fix the problem if we use different `Pickler` instances for sending the input data to python and sending the UDF results to JVM?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31682: [SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-789517356


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #31682: [WIP][SPARK-34545][SQL] Fix row comparisons

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #31682:
URL: https://github.com/apache/spark/pull/31682#discussion_r584742586



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/RowTest.scala
##########
@@ -108,6 +108,17 @@ class RowTest extends AnyFunSpec with Matchers {
     }
   }
 
+  describe("SPARK-34545: rows with different schema are not equal") {

Review comment:
       can we add an end-to-end test to show the correctness bug? I'm still not very sure why we compare rows with different schema. Sounds like something we should forbid at the analysis phase.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org