You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/11/03 14:18:23 UTC

[GitHub] [spark] HyukjinKwon commented on a change in pull request #30203: [SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls

HyukjinKwon commented on a change in pull request #30203:
URL: https://github.com/apache/spark/pull/30203#discussion_r516423241



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala
##########
@@ -218,13 +218,22 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper {
     }
   }
 
+  private def canonicalizeDeterministic(u: PythonUDF) = {

Review comment:
       The fix itself is logically making sense. But my concern is that we set this deterministic flag true by default for Python UDFs (and Scala UDFs too). When users just use UDFs actually without knowing it (which I believe arguably pretty common), they will get the same answer from UDFs users intended to work as non-deterministically after this fix.
   
   @gatorsmile and @cloud-fan, actually, shouldn't we set it as `false` by default? - I am reading https://github.com/apache/spark/commit/ebc24a9b7fde273ee4912f9bc1c5059703f7b31e. It's an arbitrary user-defined function so I think it makes sense to have the most loose condition.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala
##########
@@ -218,13 +218,22 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with PredicateHelper {
     }
   }
 
+  private def canonicalizeDeterministic(u: PythonUDF) = {

Review comment:
       The fix itself is logically making sense. But my concern is that we set this deterministic flag true by default for Python UDFs (and Scala UDFs too). When users just use UDFs actually without knowing it (which I believe is arguably pretty common), they will get the same answer from UDFs users intended to work as non-deterministically after this fix.
   
   @gatorsmile and @cloud-fan, actually, shouldn't we set it as `false` by default? - I am reading https://github.com/apache/spark/commit/ebc24a9b7fde273ee4912f9bc1c5059703f7b31e. It's an arbitrary user-defined function so I think it makes sense to have the most loose condition by default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org